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ABSTRACT 

Essay and multiple-choice scores from Advanced 
Placement (AP) examinations in American History, European History, 
English Language and Composition, and Biology were matched with 
freshman grades in a sample of 32 colleges. Multiple-choice scores 
from the American History and Biology examinations were superior to 
essays for predicting overall grade point average, but essay scores 
were essentially equivalent to multiple-choice scores for predicting 
grades in history, English, and biology. In history courses, males 
and females received comparable grades and had nearly equal scores on 
the AP essays, but the multiple-choice scores of males were about 
half of a standard deviation higher than the scores of females. To 
the extent that the AP history examinations are intended to emulate 
performance in college history courses, placing greater weight on the 
essay component of the AP history examinations would reduce sex 
differences without compromising content or predictive validity. 
Eleven tables present study findings. Appendix A lists score 
conversions, and Appendix B presents an additional three tables of 
correlations of AP scores and grades. (Contains 14 ref erei.ces . ) 
(Author/SLD) 



it it -,'c it it it it it it it is is is it it it it it it it it * it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it is 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original document. ,c 

it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it 



RR-91-48 



U.S. DEPARTMENT OF EDUCATION 
Oft tee o< £ducat*on«i Research and implement 

EDUCATIONAL RESOURCES INFORMATION 

/ CENTER (ERICI 

(J^Thit document has been reproduced as 

received from the person or orgfinnal.on 

originating it 
O Minor Changes hove been made to .mprove 

reproduction quality 



Points ot view or opinions stated m this docu 
merit do not necessarily represent ott>aai 
OERI position or policy 



■PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) " 



R 
E 
P 
O 
R 
T 



SEX DIFFERENCES IN THE RELATIONSHIP OF 
ADVANCED PLACEMENT ESSAY AND MULTIPLE-CHOICE 
SCORES TO GRADES IN COLLEGE COURSES 



Brent Bridgeman 
Charles Lewis 



BEST COPY AVAILABLE 




Educational Testing Service 
Princeton, New Jersey 
August 1991 



Sex Differences in the Relationship of 
Advanced Placement Essay and Multiple -Choice Scores to 
Grades in College Courses 



Brent Bridgeman 
Charles Lewis 



Educational Testing Service 



ERIC 



3 



Copyright© 1991. Educational Testing Service. All rights reserved. 

ERIC 1 

ummmmmim 



Acknowledgements 

Thanks to Leonard Ramist for assembling the database and supporting 
access to it. Programming and data analysis were ably performed by Annette 
Turner . 



5 



Abstract 



Essay and multiple-choice scores from Advanced Placement (AP) 
examinations in American History, European History, English Language and 
Composition, and Biology were matched with freshman grades in a sample of 32 
colleges. Multiple-choice scores from the American History and Biology 
examinations were superior to essays for predicting overall GPA, but essay 
scores were essentially equivalent to nultiple-choice scores for predicting 
grades in history, English, and biology. In history courses, males and 
females received comparable grades and had nearly equal scores on the AP 
essays, but the multiple -choice scores of males were about half of a standard 
deviation higher than the scores of females. To the extent that the AP 
history examinations are intended to emulate performance in college history 
courses, placing greater weight on the essay component of the AP history 
examinations would reduce sex differences without compromising content or 
predictive validity. 



Sex Differences in the Relationship of 
Advanced Placement Essay and Multiple -choice Scores 
to Grades in College Courses 

Essay examinations assess productive and organizational skills that can 
not be measured with multiple -choice questions, but they require time 
consuming and expensive scoring sessions that can be run only with trained 
experts in the subject area of the examination. On the other hand, multiple- 
choice tests are easy to score with machines. The two types of tests also 
differ in their coverage of the content domain. Essay examinations usually 
require an in-depth understanding of a few content areas while multiple -choice 
examinations survey a broader range of topics. Because of measurement error 
created by subjective scoring and by the relatively narrow coverage of the 
content domain, essay tests are usually substantially less reliable than 
multiple -choice tests in the same general subject area. 

Previous research comparing essay and multiple -choice tests as measures 
of writing ability suggests that the two kinds of assessment are largely 
overlapping, but that each also assesses some unique features (Breland & 
Gaynor, 1979; Quellmalz, Capell, & Chou, 1982; Breland, Camp, Jones, Morris, & 
Rock, 1987). However, much less is known about the relative contributions of 
the two types of assessment techniques for prediction of subject matter 
mastery as opposed to assessment of writing ability. 

One of the important, and currently unexplained, differences between 
multiple -choice and essay assessments of subject matter mastery is that, in 
several different subject areas, essay assessments produce smaller sex 
differences than do multiple-choice tests. Evidence from Great Britian 
(Murphy, 1982), Australia (Bell and Hay, 1987), and Ireland 'Bolger and 
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Kellaghan, 1990) consistently indicates that males have a relative advantage 
on multiple -choice tests. Mazzeo, Schmitt, and Bleistein (1989) found a 
similar male advantage on several of the Advanced Placement (AP) examinations 
that are taken by high school students who are seeking college credit or 
placement into advanced college courses based on college -level courses that 
they have completed. In several different subject areas, males and females had 
nearly equal scores on the essay portion of the examination while males had 
significantly higher scores on the multiple- choice portion. This difference 
remained even after correcting for the differential reliability of the two 
question types, and removing items from the multiple -choice test on which 
males did particularly well (i.e., high DIF items) had very little impact on 
the observed sex differences. Differences were especially striking on the 
United States History examination, with estimated true score means for males 
and females essentially equivalent on the essays (difference of less than .02 
in standard deviation units) but with the mean for males more than .3 standard 
deviation units higher than the mean for females on the multiple- choice 
portion of the test. This difference is particularly important because 
American History is one of the largest AP programs, testing over 50,000 
students a year. Similarly large differences were found on the European 
History examination. Smaller differences were found on the Biology 
examination (standardized difference on the essay section was .16 while on the 
multiple -choice section the difference was .33) and on the English Language 
and Composition examination (differpnce of .02 on the essay section and .17 on 
the multiple -choice section) . 

In a study of the validity of the Advanced Placement Examination in 
Biology for predicting grades in college biology courses, Bridgeman (1989) 



found that the essay and multiple-choice score* predicted grades about equally 
well for males, but that for females predictions were superior using the 
multiple-choice scores. However, sample sizes were small and there was 
considerable variation among colleges. 

One purpose of the current study was to assess the validity of the es-.say 
and multiple-choice sections of selected AP examinations for predicting 
success of males and females in college courses. It should be noted that the 
primary purpose of the AP examinations is to certify the acquisition of skills 
and knowledge taught in specific college-level courses taken in high school, 
not to predict success in those or similar courses taken in college. Thus, 
content not predictive validity is of paramount importance. Nevertheless, a 
reasonable part of the construct validity of such examinations is that success 
in the components of the examination should be correlated with success in 
related courses taken in college. If there were a major discrepancy between 
predictions from the essay or multiple-choice components of an examination (or 
between predictions for men and women on either component) it would not 
automatically mean that the offending component should be dropped, but it 
would suggest that it be closely scrutinized. A second purpose of the study 
was to determine whether the sex differences observed on the AP multiple- 
choice scores (and hence composite scores) were reflected in similar sex 
differences in college grades. 

The AP examinations selected for study were those with a relatively 
large discrepancy between the multiple-choice sex difference and the essay sex 
difference. An additional criterion was that the test be taken by enough 
students to have some hope of finding sufficient numbers of students for 
analysis when looking at particular courses within a college. Thus, for 
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example, the Physics B examination, which has a much larger sex difference on 
the multiple -choice section than on the free-response section, was rejected 
because too few students (especially female students) took it. The Calculus 
AB and English Literature examinations, although taken by a large number of 
males and females, were rejected because the sex differences on the multiple- 
choice and free- response sections were virtually identical. The examinations 
selected were American History, European History, English Language and 
Composition, and Biology. 

METHOD 

Sample 

Freshman grades from 45 colleges were obtained from a database that had 
been assembled to support a variaty of validity studies. Colleges in which 
fewer than five male and five female students had taken one of the target AP 
examinations in combination with the appropriate freshman courses were 
eliminated from the sample, resulting in a final sample of 32 colleges. The 
colleges in the database include both public and private institutions that use 
the Scholastic Aptitude Test (SAT) as part of the admissions process. The 
grades were from students who entered college in the fall of 1985. The 
database contains grades in individual courses as well as summary averages 
that group courses in related fields. For example, a history average 
represents the grade point average of a student in all of the history courses 
that the student: took during the freshman year. In some cases, courses 
intended for majors that included labs were separated from more general 
courses, and courses that are generally intended for more advanced students 
were separated from regular freshman courses. Grades were coded on a 13 point 

4 

W 



scale <F-0, D--.7 , D-l, D+-1.3, C— 1.7, . ..A-4, A+-4.3). In addition to 
grades, the database included SAT scores and self -reported high school grade 
point average (HSGPA) . 

AP scores from the selected examinations were added to the database and 
matched by social security number. AP tests from 1984 and 1985 were included. 
Although most AP examinations are taken at the end of the senior year in high 
school (i.e., 1985 for students who began college in 1985), a notable 
excepts <n is the American History examination which is typically taken at the 
end of the junior year because most students take American history as an 11th 
grade course. Of the 53,859 students in the original database of college 
grades, AP scores were located for 7626 students (about 14%); of these 7626 
students, 6243 took one AP examination, 1237 took two AP examinations, and the 
remaining 146 took three or more AP examinations. 
AP test descriptions 

United States History . The multiple-choice portion of this examination 
consisted of 100 items with five answer choices for each item. Examinees were 
allowed 75 minutes to answer these questions. This section was formula scored 
(score-number right-H number wrong) with negative formula scores converted to 
0. The essay portion of the test consisted of two essays. For the first 
essay, exam?.nees were provided with a set of documents and asked to construct 
an argument based on the documents. In order to receive an above average 
score the candidate had to make reference to historical facts that were not 
directly discussed in the documents provided. For the second essay, examinees 
were asked to choose one of six thematic history questions that were 
presented. 

Each of the two essays was scored by a different reader using a 0-15 



scale; thus, essay scores could range from 0 to 30. The composite score was 
created by multiplying the multiple -choice formula score by .9, multiplying 
the essay score by 3 and summing the two weighted scores. Thus, the two 
sections received nominally equal reights in forming the composite score (each 
contributed a maximum of 90 points) , but because the standard deviation of the 
multiple-choice section was slightly larger (14.8 vs. 12.7 in 1984) the 
multiple -choice section actually had a slightly greater importance in the 
determination of relative rankings on the composite score. The chief reader 
and ETS professional staff then transformed the composite score to the 1-5 
scale that was reported to colleges. This transformation was based to a large 
extent on an equating of the multiple -choice scores on a given form with the 
multiple -choice scores on an earlier form through a set of items common to 
both forms. Appendix A contains a table for converting composite scores (used 
in this report) to the 1-5 scores that are reported to colleges. 

Reliability of the multiple -choice scores, as estimated by KR-20, was 
.90 in 1984 (Eignor, Flesher, and McClean, 1984) and .89 in 1985 (Livingston, 
McClean, and Flesher, 1985). The coefficient alpha reliability of the essay 
scores was .54 in both years; because each essay is read by a different 
reader, this estimate includes both differences among readers and differences 
among topics as sources of unreliability. Reader reliability alone was about 
.79. Correlation between the multiple-choice and essay sections was .48 in 
1984 and .53 in 1985. 

European History . The general format and scoring rules for this 
examination were nearly identical to the American History examination except 
that candidates were not expected to use outside knowledge in answering the 
document-based essay question. The KR-20 reliability of the multiple -choice 



score was .91 and the coefficient alpha reliability of the essay score was 
.44. The correlation between the two sections was .50 (Mazzeo and Flescher, 
1985a) . 

English Language and Composition . This examination consisted of 65 
five -option multiple -choice items that were formula scored in the manner 
described above. This portion of the exam took 75 minutes and contained two 
types of questions. One type tested a student's ability to manipulate syntay 
by recasting sentences, and the other type contained questions that asked the 
student to analyze the rhetoric, style, and content of prose passages. For 
the essay portion of the examination, the examinee was given a 15 -minute 
reading period, and then asked to answer three questions in 90 minutes. Each 
question required a response in a different rhetorical mode . Each essay was 
graded by a different reader using a 9 -point scale, so raw scores ranged from 
0 to 27. The composite score was created so that the multiple-choice section 
contributed 40 percent and the essay section contributed 60 percent of the 
maximum total score of 150 (multiply the multiple-choice score by .923, 
multiply the essay score by 3.333, and sum). Reliabilities of the multiple- 
choice and essay sections were .88 and .56 respectively. The correlation 
between the two sections was .51 (Livingston, Karatka, and Bleistein, 1985). 

Biology . In 1985, the 90-minute multiple-choice portion of this 
examination consisted of 120 five -option items that were formula scored. 
Three topics were assessed with 40 items on each topic: (A) Cellular and 
Molecular, (B) Organismal, and (C) Populational . On the 75-minute essay 
section there were three pairs of questions, one pair from each of the above 
topics. The candidate was instructed to choose one question from each pair. 
Each of the three essays was graded on a 15-point scale. Multiple-choice 



scores were multiplied by .625 and essay scores were multiplied by 1.667 so 
that the two portions of the examination made nominally equal contributions to 
the total score of 150. Reliability of the multiple-choice section was .93 
while the coefficient alpha reliability of the essay section was .66 (Mazzeo 
and Flesher, 1985b). Reader reliability alone was about .85. The correlation 
of essay and multiple -choice scores was .73. 

RESULTS AND DISCUSSION 
Correlation of AP examinations with college GPAs 

For each college that contained at least five male and five female 
students who took the relevant AP examination, a correlation was computed 
between the overall college grade point average (GPA) and the essay and 
multiple -choice AP scores. These within- college correlations were transformed 
to z. scores with Fischer's £ to z transformation and were weighted by n-3. 
The weighted mean of the z. scores was then transformed back to £. The 
correlations are presented in Table 1. Except as noted for the 1984 American 

Insert Table 1 about here 

History examination, the tests were administered in 1985 and reflected 
performance at the end of the senior year in high school. For a particular 
examination, the correlations for the multiple -choice section and essay 
section were generated by the same people and may be readily compared to 
determine whether one question type is a significantly better predictor of 
overall GPA (Dunn & Clark, 1969). As indicated in the table, for Biology and 
for both years of the American History test the multiple -choice scores are 
significantly better predictors of GPA. Furthermore, for all practical 
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purposes the composite score for these examinations was no better than the 
multiple -choice score by itself. Patterns within sex essentially replicated 
the combined sex results although they sometimes fell short of statistical 
significance in the smaller groups. For the European History and English 
Language examinations the differences between predictions from multiple-choice 
and essay questions were not significant, and the composite appeared to be 
better than either section by itself. 

Across examination comparisons must be made more cautiously because 
people are not randomly assigned to take particular examinations, and the 
characteristics of students who choose to take the biology examination, for 
example, might be quite different from the students who choose to take 
American History. Given this caveat, there is a surprising degree of 
similarity among the examinations as predictors of GPA; across examinations 
the correlation of the composite score with GPA ranged from .31 to .36. 

Correlations of the AP examinations with grade point average only in 
social science courses are presented in Table 2. The social science grade 

Insert Table 2 about here 

average includes such courses as anthropology, psychology, and sociology taken 
during the freshman year, but does not include courses in any of the target 
disciplines (history, English, and biology) . The pattern of correlations for 
this set of courses was nearly identical to the pattern for the full freshman 
grade point average. 

Correlation of AP history examinations with college history grades 

Within each college that contained at least five male and five female 
students who took the 1984 AP American History examination and at least one 
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history course, a correlation was computed between the history grade point 
average,, and the essay and multiple -choice AP scores. These within- college 
correlations were averaged in the same manner described above. Similar 
procedures were followed for students who took the 1985 American History or 
European History examinations. 

Insert Table 3 about here 

The averaged correlations are presented in Table 3; Appendix B presents 
the correlations separately for each college in the sample. For prediction of 
history grades, the AP American History essays appear to be at least as good 
as the multiple -choice questions. Although it is difficult to prove the null 
hypothesis, it should be noted that the standard error of the difference 
between the '84 American History essay and multiple -choice correlations was 
only .036 with the 95% confidence interval ranging from -.07 to .07. Results 
for the European History examination were less clear because of the relatively 
small sample sise. The apparent advantage for the multiple -choice items on 
this examination was not statistically significant. 

Correcting the essay score for unreliability (recall that the essay 
score is much less reliable than the multiple -choice score) presents an even 
stronger case for the potential value of an essay score. In the large '84 
American History sample, the estimated correlation of a perfectly reliable 
essay test with history grades (which were not corrected for unreliability) 
was .44 while the correlation of a perfectly reliable multiple -choice test 
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with grades was .31 1 . Assuming that the standard error of the corrected 
scores was reasonably similar to the standard error of the uncorrected scores, 
this would represent a significant advantage of a perfectly reliable essay 
test over a perfectly reliable multiple -choice test for predicting history 
grades . 

Although sex differences in prediction for '84 American history and '85 
European history appeared to be trivial, '85 American history appeared to be 
anomalous with multiple -choice questions a significantly better predictor for 
women; the sex difference for the essay showed a nonsignificant trend in the 
opposite direction. No explanation of this difference was uncovered. There 
were no changes in the test specifications from 1984 to 1985. Although the 
essay topics were different in the two years, the discrepancy was noted for 
both the document essay and the choice essay in 1985. The document essay 
correlated .24 with history grades for males and .01 for females; the choice 
essay correlated .29 for males and .06 for females. For the '84 examination, 
there were no sex differences in prediction for the document essay (.28 vs 
.23) or the choice essay (.22 vs .21). Differences in overall group ability 
or homogeneity across the two years would not lead to the kind of sex by 



x In order to make the correction, within-college reliability estimates 
for both multiple -choice and essay scores were needed. The within-college 
reliability of the essay could be directly estimated with KR-20 because the 
correlation between the two essay scores (document and choice) within each 
college could be computed from the available scores and averaged over 
colleges. However, the reliability of the multiple-choice scores could not b 
directly determined because within-college item level data were not available 
But reliability estimates for both the essay and multiple-choice scores were 
available from the national sample. The within-college reliability for the 
multiple -choice score was estimated by assuming that the ratio of true score 
variance in the college sample to true score variance in the national sample 
is the same for both essay and multiple -choice tests. Because this ratio 
could be computed for the essay score, and the reliability of the multiple- 
choice score in the national sample was known, the reliability of the within- 
college multiple -choice score could be estimated. 

11 



question type (multiple -choice vs essay) interaction observed; in any event, 
means and standard deviations were comparable in the two years. Because the 
'85 American history sample is relatively small, it may be unwise to pay too 
much attention to an apparent sex interaction that failed to replicate in a 
sample that was more than twice as large. 

The relative advantage of essays for predicting history grades, as 
contrasted with predicting overall GPA, may relate to the likelihood that 
history courses will be graded with essay tests. A telephone survey of the 
largest colleges in the sample confirmed that essay tests were always the 
primary grading criterion in the history cours-es while other courses 
frequently relied on multiple -choice examinations. The importance of the 
similarity of assessment methods for the predictor and criterion is 
underscored by a comparison of the AP scores with high school grade point 
average and SAT verbal (SAT-V) and mathematics (SAT-M) scores. High school 
grade point average is based to some extent on performance on multiple -choice 
classroom tests but essay tests and other assessment methods are also 
included; the SAT is strictly a multiple -choice test. Table 4 shows that for 

Insert Table 4 about here 

the two largest samples ('84 American history and '85 biology) essay and 
multiple -choice AP scores were about equally correlated with high school grade 
point average (HSGPA) , but AP multiple -choice scores were substantially more 
related to SAT scores than were AP essay scores . 

Correlation of English Language AP scores with college English grades 
The correlations of the scores from the AP English Language and 
Composition examination with college English grades are presented in Table 5. 
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The results were very similar to the history findings; despite its lower 

Insert Table 5 about here 

reliability, the essay score was at least as good a predictor as the multiple - 
choice score in both sex groups, and the composite score was a slightly better 
predictor than either of the individual scores. The reasoning that explained 
the history results also applies here; when the criterion score (i.e., English 
grade) is largely determined by the quality of student writing then the score 
on an essay examination is a relatively good predictor. 
Correlation of AP Biology with college biology grades 

For this analysis only regular courses in the biological sciences that 
were normally open to freshman students were considered; courses that were 
specifically designated for students majoring in biology were excluded. As 
noted previously (Bridgeman, 1989), very few students use their AP experience 
to enroll in advanced biology courses during their freshman year. In the 
current set of colleges, a sample of only 43 students in advanced courses was 
identified. This sample was deemed to be too small for meaningful analysis. 
Also consistent with the previous study was the finding that students are 
often encouraged not to take biology courses before their sophomore year. 
Thus, for example, at one of the larger colleges in the sample 62 students 
were identified who had taken the AP biology examination, but none of these 
students was enrolled in a regular biology course . 

As indicated in Table 6, correlations were comparable for the essay and 

Insert Table 6 about here 

multiple -choice scores, and sex differences were not significant. The absence 
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of a difference between the correlation from the essay test and the 
correlation from the multiple -choice test in the female sample failed to 
replicate the finding of such a difference in an earlier study (Bridgeman, 
1989). Combining the data on female students from both studies into a single 
analysis (n-370) resulted in a correlation with biology grades of .34 for the 
essay score and .45 for the multiple -choice score with a standard error of the 
difference of .06, thus just reaching the conventional standard of statistical 
significance (£-2.19, £<.05). But the finding is not very robust; removing 
the single course with the largest difference in the first study (which was 
also the largest course [n-51]) from the combined sample resulted in 
correlations that did not differ significantly (rs of .38 and .42 for the 
essay and multiple -choice scores respectively). As noted in the earlier 
study, there is substantial variation among colleges in these patterns of 
correlations; in 13 of the independent samples of female students in the 
combined studies the essay score was a better predictor than the multiple- 
choice score while the opposite pattern was observed in the other 18 samples. 

Summary of correlational results 

When grades in history, English, or biology courses are the criterion, 
the essay scores from the corresponding AP examinations predict about as well 
as the multiple -choice scores despite their lower reliabilities. Sex 
differences in correlational patterns appeared to be trivial, especially in 
the large English and '84 American history samples. For unknown reasons, the 
essay was a poor predictor for females in the relatively small '85 American 
history sample. 
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Mean differences by sex for AP history scores and college history grades 
For each college with data from at least five male and five female 
students, score means for the AP essay, multiple - choice , and composite scores 
were computed separately by sex. The difference between these means (in 
pooled standard deviation units) was computed and weighted to give an unbiased 
estimate of the population value (see Hedges and Olkin, 1985 for an 
explanation of this procedure); this standard difference is called d in the 
tables. Arbitrarily, positive values of d indicate higher scores for males. 
Similarly, d was computed for the history grade average in each college. 
Again following procedures described by Hedges and Olkin, the weighted average 
of the ds was computed, the 95% confidence interval for this mean was 
determined, and the test statistic Q (which has an asymptotic chi-square 
distribution under the null hypothesis) was computed to test the 
reasonableness of the assumption that each college sample is estimating the 
same population effect size. 

Means, standard deviations, and ds for the '84 AP American history 
examination and college history grades are presented in Table 7. Consistent 

Insert Table 7 about here 

with Mazzeo et. al. (1989), sex differences on the multiple -choice score were 
substantially larger than on the essay score. Course grades of males and 
females did not differ significantly (95% confidence interval from -.22 to 
.04). Note that the confidence intervals for the multiple -choice score 
difference and the composite score difference do not overlap with the 
confidence interval for the course grade difference, suggesting that these 
scores underestimate the ability of females to perform in college history 
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courses. Given that, as already noted, college history courses are primarily 
graded with essay tests this result is not surprising, but its practical 
significance should not be ignored. To put this difference in more concrete' 
terms, suppose you had a special award that was to be given for above average 
performance in history. If that award were based on college history grades or 
AP essay scores, equal numbers of males and females would be recognized. But 
if it were based on AP multiple -choice scores, if half of the males received 
the award, then only about one-third of the females would receive it; 

It might be argued that the history grades of males were artificially 
low because they selected more difficult history courses. However, this was 
not the case; analyses at the individual course level (i.e., comparing males 
and females enrolled in the same course) revealed the same pattern as at the 
history grade average level. 

The large sex difference observed for the multiple -choice questions on 
the '84 American history examination does not suggest that males would display 
the same advantage on all multiple -choice tests. Scores on the verbal portion 
of the Scholastic Aptitude Test (SAT-V) were available for all of the students 
in the '84 American history sample; scores were only slightly higher for males 
(d-.15) and the 95% confidence interval (.03 to .28) did not overlap with the 
confidence interval for the difference in the history multiple -choice scores. 
Thus, the sex difference on the AP multiple-choice history questions was 
significantly larger than the sex difference on SAT-V. 

Tables 8 and 9 indicate that the same general pattern of sex differences 
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found in the '84 American history sample was also found in the smaller '85 
American history and European history samples. 

Mean differences by sex for AP English Language scores and English grades 

As indicated in Table 10, there were no statistically significant sex 

Insert Table 10 about here 

differences on any of the AP English Language and Composition scores or on 
college English grades although there was a nonsignificant trend (confidence 
interval -.42 to .05) for the females to outperform males on the essay. This 
is consistent with the relatively modest question type by sex interaction 
noted by Mazzeo et. al. (1989). 

Mean differences by sex for AP Biology scores and biology grades 

Table 11 shows that, among the various indicators of performance in 

Insert Table 11 about here 

biology, males significantly outperformed females only on the multiple -choice 
AP scores, but the maximum difference in the ds (between the multiple -choice 
score and course grade) was only .23. For comparison, note that a difference 
of at least .57 between t \e multiple -choice d and the grade d was observed in 
all three history samples. 

CONCLUSIONS 

In terms of correlations with college freshman GPA, multiple -choice 
scores from the AP American History and Biology examinations appeared to be 
superior to the essays as predictors. When grades in history, English, and 
biology were the criteria, AP essay and multiple -choice scores (from the 
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subject -appropriate examination) were about equally good predictors, and 
overall sex differences in the magnitude of the correlations were trivial. 
But, just because two scores have approximately equal correlations with a 
criterion does not necessarily mean that it is a matter of indifference which 
score is used or how the scores are weighted to form a composite. Thus, for 
example, female history students would be disadvantaged by a greater reliance 
on multiple -choice scores both in terms of absolute numbers identified and in 
terms of the number identified holding ultimate success in college history 
courses constant. Indeed, the current results suggest that more weight should 
be placed on the history essays to achieve more nearly sex- fair selections for 
granting college credit or advanced placement. This argument assumes that 
college grades are themselves sex fair, but it could also be argued that 
college history grades are biased against males because they rely largely on 
essay examinations on which males do poorly relative to their performance on 
multiple-choice tests. However, to the extent that AP examinations should 
reflect what actually occurs in college classes, this argument is irrelevant. 

If more weight were placed on the essay sections, efforts should be made 
to improve their reliabilities. The greatest improvement could be aade by 
increasing both the number of essays written by the student and the number of 
readers rating each essay. Increasing either component by itself should have 
some positive impact on score reliability. Some improvement may also be 
possible by making statistical adjustments for systematic differences in the 
scoring standards of different raters (Braun, 1988) . 

Future research should include a survey of college grading practices to 
identify the extent to which grades in various courses are determined by in- 
class essays, multiple -choice tests, or other assessment techniques. In 
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addition, studies should be performed that could rule out construct -irrelevant 
influences as the determiners of sex differences on both essay and multip e- 
choice scores. For example, the generally neater handwriting of females could 
be discounted as an explanation of their relatively strong performance on 
essay tests if the same size sex difference were observed with typed essays. 
For the multiple-choice tests, a possible greater willingness of male 
examinees to guess when uncertain could be evaluated with a test that 
contained no penalty for guessing. Finally, if the sex differences on the two 
types of tests appear to be caused by true differences on the somewhat 
different constructs assessed by the two formats, then research should focus 
on understanding how these differences developed. Educational strategies 
might then be developed that could optimize performance for both sexes on 
essay and multiple -choice tests. 
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Appendix A 

Conversion from Composite Score to 1-5 Score 



Composite * ^ stufate 

Score taking test 

AP Examination Range 1-5 Score 



84 American 



w , _ 0-50 1 

History 51-77 2 28 

78-94 3 27 

95-117 4 28 
118-180 



5 6 



American 0-47 1 J 

History 48-73 2 29 

74-89 3 27 

90-111 * 26 

112-180 c 11 



European 0-60 1 H 

History 61-77 2 16 

78-103 3 39 

104-123 * 24 

124-180 5 U 



English Language 0-4 f . 1 15 

and Composition 50-73 2 

74-89 3 34 

90-101 * 30 

102-150 5 4 



Biology 0-35 1 \\ 

37-54 2 19 

55-76 3 31 

77-96 4 24 

97-150 5 14 



Note: Except tor the 1984 American history examination, all data is tor the lyea 
examinations 
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Detailed Tables for Correlations of 
AP History Examinations and Grades 
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Table 3 



Correlation of AP History Tests with Colleg e History Grades 



European 
History 



Gender 



Test/n 



'84 American 
History 



'85 American 
History 



Combined 



Essay -29 

Multiple-Choice .28 

Composite -35 

n -- students 991 

n -- colleges 18 



.23 
.24 
.29 
342 
11 



.22 
.31 
.32 
240 
7 



Male 



Essay .30 

Multiple-Choice .28 

Composite .36 

n -- students 546 



.33 
.19 
.31 
190 



.25 
.36 
.39 
160 



Female 



Essay .28 

Multiple -Choice .31 

Composite .35 

n students 445 



.05* 
.35 
.27 
152 



.14 
.29 
.23 
80 



♦Significant (e<.05) difference between essay and multiple choice correlations 
with grades. 
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Table 4 



Correlation of AP History and Biology scores with SAT-V and H5.SPA 



AP Scores 

84 History Biology 

Essay M-C Essay M-C 

SAT-V .25 .53 .28 .53 

SAT-M .10 .29 .23 .43 

HSGPA .16 .20 .15 .19 



Note - History n=2837; biology 11=1104 
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